-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey @leezu , Thanks for submitting the PR
CI supported jobs: [windows-gpu, unix-cpu, centos-cpu, sanity, unix-gpu, website, clang, windows-cpu, edge, miscellaneous, centos-gpu] Note: |
if 'CUDA_PATH' not in os.environ: | ||
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2" | ||
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the CUDA change. That does not represent a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a requirement for VS2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually visual studio does allow more than one cuda version. You just have to install the respective Toolkit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Cuda 9.2 does not support VS 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, they intentionally broke that backwards compatibility feature. Usually it was possible to compile older cuda versions in later vs versions by installing toolkits which make sure that the integration is available. Seems like that caused issues and thus Microsoft and Nvidia decided to not go that route any further.
In that case, find to proceed.
But is there still some kind of compatibility mode which checks that the is still compliant with older cuda standards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is there still some kind of compatibility mode which checks that the is still compliant with older cuda standards?
No. Unix & CentOS tests Cuda 10.1. Windows now tests Cuda 10.2.
But the risk is low that we'd break Cuda 9 support within the next few days. So it's not a one-way door decision. I suggest we discuss on dev if we want to support cuda 9. If we decide to support it, let's switch the CentOS tests to use Cuda 9.
Connectivity issues became a problem after switching to Windows Server 2019. The switch was done as the old AMI apparently can't be rebuilt anymore and a new AMI had to be started. I was not involved in that effort, but I think it's reasonable to get the Windows AMI instructions working again and choose the latest Windows Server for doing that. Jenkins connectivity issues typically result from system or network load problems. The newer version of Windows may have some issues causing network problems on a slower machine such as g3. If you have an alternative fix, please propose it. Moving to a g4 instance to run the Windows GPU tests resolved the connectivity issue. |
Well the new Ami should not have been moved into production then. Sorry, but changing a hundred knobs to facilitate one change isn't right. Either get a stable replacement and deploy that or leave stuff as it is. Replacing an existing system with an inferior version does not make sense to me. These are some standards which I do not see aligned with the projects interest. |
Could you elaborate on how the distinction between running tests on a g3 or g4 does not align with the projects interest? |
Based on offline discussion with Marco, let's use a patched version of the old AMI first to fix the CI. @josephevans helped to install VS Code 2019 on the old AMI. I have further reduced the diff of this PR to include only the minimal changes to switch to VS Code 2019 and the x64 toolchain. If this fixes the issue, we'll update to the new AMI with updated cuda and g4 instances at a later point after running it in the dev environment for a while. |
184ca36
to
f70dc39
Compare
a348cfd
to
3fd7f4a
Compare
if 'CUDA_PATH' not in os.environ: | ||
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2" | ||
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually visual studio does allow more than one cuda version. You just have to install the respective Toolkit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please summarize the changes and the reasonings in the PR description?
For example, I think it is important to note why you're bumping the CUDA version and details like this shouldn't be buried deep in the PR's communication. Anyone researching what happened here might have a hard time. Also, setting up Windows for yourself, you might want to see how CI does it and why.
So are going to x version of VS as a default?
And what's this cmake change?
Done
#17808 will provide an updated setup. This PR is only a emergency fix. |
@marcoabreu gpu build is still flaky due to thrust + VS2019 issues. Adding back the retries. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17962/17/pipeline |
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>
* Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]>
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>
* Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]>
* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>
* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>
* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]> Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>
Description
Minimal version of #17808
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled.
CC: @marcoabreu